Predicting fish weight from other highly correlated covariates

Thomas N. Larsen and Sindre B. Remman

Report outline

Introduction

The dataset - Fish Market

Our chosen dataset contains 159 samples from 7 different species of fish [2]. With this dataset, our aim is to create a predictive model using multiple linear regression to estimate the weight of the fish.

Covariate Description Scale
Species: Name of the fish [categorical]
Weight: Weight of fish in grams [g]
Length1: Vertical length [cm]
Length2: Diagonal length [cm]
Length3: Cross length [cm]
Height: Height [cm]
Width: Diagonal width [cm]

Below, the correlation matrix is shown for continuous variables. Looking at the correlation matrix, it is clear that the covariates suffer from multicollinearity. Notice that the covariates regarding the lengths of the fish are nearly perfectly correlated. We can also see from the pairplot below that we have some cases of heteroscedasticity in the data.

Data preprocessing

To prepare the data, we separate the covariates, $X$, from the responses, $y$, from the full dataset. To accommodate the categorical variable $\textit{Species}$, we use one-hot encoding. This covariate is therefore split into 7 columns of boolean values. Then, the data is split into training and test sets. The normalization parameters are fit to the training data only, such that we avoid peeking at the test data in normalization.

The one-hot encoded $\textit{Species}$ are included in the new correlation matrices below. Naturally, the species seem to group the data into narrow ranges of the other physical attributes, i.e., $\textit{Species}$ may be an important predictor for the weight.

Analysis

In this chapter we apply Ordinary Least Squares, Ridge regression, Lasso regression and Group Lasso regression on our data. The aim is to find how the different methods handle the multicollinearity in the data. Each method will be briefly introduced and their results are compared in the next chapter. Before applying these methods, we first investigate how to handle the heteroscedasticity in our data.


Presence of heteroscedasticity

Heteroscedasticity may impact the regression models' predictive performance. One way to dampen heteroscedasticity is to transform the weight response into log-weights. To evaluate the effect, we fit OLS models to normal weights and log-weights and look at the Q-Q plots of the residuals.

Comparing the two residual Q-Q plots, it may seem that the log-transformed responses give more normally distributed residuals. To verify this, we can apply some additional normality tests using the R-inspired statsmodels.formula.api. [3]

The automated normality tests above gives the following results:

Without log-weight transformation: | Method | Value | Method | Value | | --- | --- | --- | --- | Omnibus: | 7.651 | Durbin-Watson: | 0.634 Prob(Omnibus): | 0.022 | Jarque-Bera (JB): | 6.008 Skew: | 0.533 | Prob(JB): | 0.0496 Kurtosis: | 2.278 | Cond. No. | 595.

With log-weight transformation: | Method | Value | Method | Value | | --- | --- | --- | --- | Omnibus: | 12.231 | Durbin-Watson: | 0.717 Prob(Omnibus): | 0.002 | Jarque-Bera (JB): | 14.030 Skew: | -0.736 | Prob(JB): | 0.000898 Kurtosis: | 4.305 | Cond. No. | 595.

In these summaries, the Prob(JB) values correspond to p-values. For the null hypothesis that the data is homoscedastic, these should be higher than the significance level, typically chosen to be around 0.05. It is obvious that the log-transform significantly reduces the p-value. From this analysis, it seems that log-transforming the response is not helpful. Nevertheless, we also test the cross-validation performance for Ordinary Least Squares with and without the log-transform to investigate the effect on the model's predictive power.

Least Squares

Ordinary least squares (OLS) will be used as a benchmark against the regularized methods. In short, OLS constructs an estimator $\hat{\beta}$ by:

$\hat{y} = X^T\hat{\beta}$, where $\hat{\beta}$ is chosen such that $\min_\hat{\beta} ||X^T\hat{\beta} - y||_2^2$.

To find $\hat{\beta}$, we use the implementation in sklearn.linear_model.LinearRegression(), and apply 10-fold cross-validation using sklearn.model_selection.cross_validate() to evaluate the model.

As mentioned above, we also run cross-validation with and without the log-transformed response variable to determine whether this transformation is benefitial.

As we can see from these results, the cross-validation results are significantly better when the response is log-transformed. Therefore, we continue to use this log-transformed response throughout the rest of the report.

Functions for cross-validating a range of regularizer values.

In the subsequent regularized methods, some hyperparameter tuning is required to determine the optimal level of regularization. Therefore, we need to create some functions that cross-validate a range of hyperparameters.

Ridge regression

Regularizes the models by limiting the coefficients' size according to: $\min_\beta ||X^T\beta - y||_2^2 + \alpha ||\beta||_2^2$

We first find the mean and standard deviations of the R2 scores according to the function defined above. We then select the regularization hyperparameter that gave the best mean score during the cross-validation.

The first plot shows the R2 scores as a function of the regularization hyperparameter, where the blue vertical line shows the $\alpha$ that gave the best mean R2-score. The second plot shows how the coefficients change with increasing $\alpha$.

In the coefficient plot, we can see that the coefficient for the almost perfectly correlated variables length1 and length2 are poorly defined. Length2 has a large positive value and the coefficient of length1 cancels this out by having a correspondingly large negative value. As the regularization hyperparameter is increased, these coefficients become smaller, and the problem of large coefficients is alleviated as described in [1, p. 63].

Lasso

Regularizes the models by limiting the coefficients' size according to: $\min_\beta \frac{1}{2n_{samples}}||X^T\beta - y||_2^2 + \alpha ||\beta||_1$

The same procedure as done when finding the best hyperparameters for Ridge regression is followed below. There is a similar interaction between the coefficents for $\textit{Length1}$ and $\textit{Length2}$ as for Ridge regression, but their development as the regularization increases is considerably less smooth than for Ridge.

Sparse Group Lasso

The Sparse Group Lasso estimator is defined by:

$\argmin_{\beta_g \in \mathbb{R}^{d_g}} || \sum_{g\in \mathcal{G}} [\mathbf{X}_g \beta_g]-\mathbf{y}||_2^2 + \lambda_1 ||\beta||_1 + \lambda_2 \sum_{g\in\mathcal{G}}\sqrt{d_g}||\beta_g||_2$ [4]

The sparse group lasso regularizer is a combination of standard lasso and group lasso regularization. This also means that there are two hyperparameters to tune.

The first hyperparameter is group regularization, denoted in code by "group_reg" and the second hyperparameter is $L_1$ regularization, denoted by "l1_reg". Therefore, we need 2D cross-validation to determine the best set of hyperparameters.

Since we have seven categorical variables that show the species of the fish, we group these variables together. The other covariates are not grouped, which means that they are only regularized with normal lasso.

We plot the mean cross-validated R2 score on a surface to determine the best set of hyperparameters, indicated by a red dot.

We now select the regularization hyperparameters that gave the best mean CV-score and fit our model.

Results

In this section we compare the methods' CV performance and apply the best performing method on the test data. Finally, we compare the models' coefficient estimates.

Comparing cross-validation results of the models used in this paper

Model Mean CV R2-score Standard deviation CV R2 score
OLS 0.954 0.034
Ridge 0.960 0.030
Lasso 0.959 0.030
Sparse group lasso 0.960 0.030

As can been seen from the table above, the model that got the best cross-validation results from the experiments done in this report is Ridge regression. Therefore, we now test this model on the test data.

Applying the methods to the test data

Typically, one would expect a worse test set performance compared to the training CV score. In our case, the R2-score on the test set is actually better than the mean from the cross-validation. The score on the test set is, however, still within one standard deviation from the mean score of the cross-validation performance. While this is uncommon, it is not unreasonable considering the small sample size and accurate model.

Comparing model coefficients

To further compare the different models, we will now plot the coefficients of the different models used in this report. These coefficients correspond to the best performing model for each method in the cross-validation.

Discussing which method is best for the dataset

Note that only lasso picks a single one of the highly correlated covariates (the lengths) while forcing the others to zero. Instead, both ridge and sparse group lasso set the coefficients of these three lengths to approximately the same value.

Consider that Lasso achieves only marginally worse predictive power than Ridge and Group Lasso. However, Lasso also ignores 5 of the 12 covariates, thus predicting with a significantly simpler model than the other methods. If we had applied the one standard error rule for hyperparameter selection for Lasso, then more of the covariates would have been ignored in the model.

References

[1] Hastie, T., Hastie, T., Tibshirani, R., & Friedman, J. H. (2001). The elements of statistical learning: Data mining, inference, and prediction. New York: Springer.

[2] Dataset: https://www.kaggle.com/aungpyaeap/fish-market

[3] Normality testing: https://www.math.ntnu.no/emner/IST100x/ISTx1003/Regresjon.html#Simulert_eksempel_(forts)31

[4] Moe, Y. M. (2019), Group-lasso docs https://group-lasso.readthedocs.io/